Depression is a mood disorder that affects more than 264 million people globally, making it one of the leading causes of disability worldwide. It's characterized by symptoms such as profound sadness, the feeling of emptiness, anxiety, sleep disturbance, as well as a general loss of initiative and interest in activities. The severity of a depression is determined by the quantity of symptoms, their seriousness and duration, as well as the consequences on social and occupational function.
One way to classify depression is unipolar and bipolar; unipolar depression refers to major depressive disorder and bipolar depression is a facet of bipolar disorder. They are both genetic mood disorders and share symptoms, but a distincion should be made between the two: bipolar depression is unique in the periodic occurrence of mania, a state associated with inflated self-esteem, impulsivity, increased activity, goal-directed actions, and reduced sleep.
Although there are known, effective treatments for mental disorders, between 76% and 85% of people in low- and middle-income countries receive no treatment for their disorder. One barrier to effective care is inaccurate assessment: in countries of all income levels, people who are depressed are often not correctly diagnosed.
Actigraphs are small motion sensor detectors (accelerometers) that are encased in a unit about the size of a wristwatch, and can be worn continuously for days to months. It is well established that depression is characterized by altered motor activity, and actigraph recordings of motor activity are considered an objective method for observing depression. Despite not being exhaustively studied yet, there is an increasing awareness in the field of psychiatry on how the activity data relates to various mental health issues such as changes in mood, personality, inability to cope with daily problems, or stress and withdrawal from friends and activities.
In the following tutorial, we will walk through the Data Science Pipeline to see if depression states can be accurately predicted through the sensor data recorded by Actigraphs.
References
We'll be looking at Actigraphic data originally collected for a study on motor activity in schizophrenia and major depression. Actigraphs continuously record an activity count proportional to the intensity of movement in one minute intervals. The dataset consists of actigraphy data collected for the condition group (23 unipolar and and bipolar depressed patients) as well as the control group (32 non-depressed contributors). We'll be using the The Montgomery-Asberg Depression Rating Scale (MADRS) score included in the data for each participant to identify the severity of an ongoing depression. The score is based on ten items relevant for depression, which clinicians rate based on observation and conversation with the patient. The sum score (0-60) represents the severity: scores below 10 are classified as an absence of depressive symptoms, and scores above 30 indicate a severe depressive state.
References
First let's import the libraries we'll use throughout the tutorial.
# for data
import os
import glob
import pandas as pd
import datetime as dt
from datetime import time
# for plotting
import seaborn as sns
import missingno as msno
import matplotlib
import matplotlib.pyplot as plt
import matplotlib.dates as mdates
# for machine learning
from sklearn.model_selection import train_test_split
from sklearn.ensemble import RandomForestClassifier
from sklearn import metrics
from sklearn.linear_model import HuberRegressor
from IPython.display import display, Math
from sklearn.model_selection import cross_val_score
from sklearn.model_selection import RepeatedKFold
from numpy import mean
from numpy import std
from numpy import absolute
import numpy as np
# other
from IPython.display import Image
import warnings
The dataset contains:
# skipinitialspace skips spaces after the delimeter , (allows blank cells to be filled with NaN)
scores = pd.read_csv('data/scores.csv', skipinitialspace=True)
display(scores)
| number | days | gender | age | afftype | melanch | inpatient | edu | marriage | work | madrs1 | madrs2 | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | condition_1 | 11 | 2 | 35-39 | 2 | 2 | 2 | 6-10 | 1 | 2 | 19 | 19 |
| 1 | condition_2 | 18 | 2 | 40-44 | 1 | 2 | 2 | 6-10 | 2 | 2 | 24 | 11 |
| 2 | condition_3 | 13 | 1 | 45-49 | 2 | 2 | 2 | 6-10 | 2 | 2 | 24 | 25 |
| 3 | condition_4 | 13 | 2 | 25-29 | 2 | 2 | 2 | 11-15 | 1 | 1 | 20 | 16 |
| 4 | condition_5 | 13 | 2 | 50-54 | 2 | 2 | 2 | 11-15 | 2 | 2 | 26 | 26 |
| 5 | condition_6 | 7 | 1 | 35-39 | 2 | 2 | 2 | 6-10 | 1 | 2 | 18 | 15 |
| 6 | condition_7 | 11 | 1 | 20-24 | 1 | NaN | 2 | 11-15 | 2 | 1 | 24 | 25 |
| 7 | condition_8 | 5 | 2 | 25-29 | 2 | NaN | 2 | 11-15 | 1 | 2 | 20 | 16 |
| 8 | condition_9 | 13 | 2 | 45-49 | 1 | NaN | 2 | 6-10 | 1 | 2 | 26 | 26 |
| 9 | condition_10 | 9 | 2 | 45-49 | 2 | 2 | 2 | 6-10 | 1 | 2 | 28 | 21 |
| 10 | condition_11 | 14 | 1 | 45-49 | 2 | 2 | 2 | 6-10 | 1 | 2 | 24 | 24 |
| 11 | condition_12 | 12 | 2 | 40-44 | 1 | 2 | 2 | 6-10 | 2 | 2 | 25 | 21 |
| 12 | condition_13 | 14 | 2 | 35-39 | 1 | 2 | 2 | 11-15 | 2 | 2 | 18 | 13 |
| 13 | condition_14 | 14 | 1 | 60-64 | 1 | 2 | 2 | 6-10 | 2 | 2 | 28 | 19 |
| 14 | condition_15 | 13 | 2 | 55-59 | 2 | 2 | 2 | 11-15 | 1 | 1 | 14 | 18 |
| 15 | condition_16 | 16 | 1 | 45-49 | 2 | 2 | 2 | 11-15 | 1 | 2 | 13 | 17 |
| 16 | condition_17 | 13 | 1 | 50-54 | 1 | 2 | 2 | 6-10 | 1 | 2 | 17 | 15 |
| 17 | condition_18 | 13 | 2 | 40-44 | 3 | 2 | 2 | 11-15 | 2 | 2 | 18 | 15 |
| 18 | condition_19 | 13 | 2 | 50-54 | 2 | 2 | 1 | 16-20 | 2 | 2 | 26 | 21 |
| 19 | condition_20 | 13 | 1 | 30-34 | 2 | 1 | 1 | 6-10 | 1 | 2 | 27 | 25 |
| 20 | condition_21 | 13 | 2 | 35-39 | 2 | 2 | 1 | 6-10 | 2 | 2 | 26 | 21 |
| 21 | condition_22 | 14 | 1 | 65-69 | 2 | 2 | 1 | NaN | 2 | 2 | 29 | 28 |
| 22 | condition_23 | 16 | 1 | 30-34 | 2 | 2 | 1 | 16-20 | 2 | 2 | 29 | 23 |
| 23 | control_1 | 8 | 2 | 25-29 | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN |
| 24 | control_2 | 20 | 1 | 30-34 | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN |
| 25 | control_3 | 12 | 2 | 30-34 | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN |
| 26 | control_4 | 13 | 1 | 25-29 | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN |
| 27 | control_5 | 13 | 1 | 30-34 | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN |
| 28 | control_6 | 13 | 1 | 25-29 | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN |
| 29 | control_7 | 13 | 1 | 20-24 | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN |
| 30 | control_8 | 13 | 2 | 40-44 | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN |
| 31 | control_9 | 13 | 2 | 30-34 | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN |
| 32 | control_10 | 8 | 1 | 30-34 | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN |
| 33 | control_11 | 13 | 1 | 45-49 | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN |
| 34 | control_12 | 14 | 1 | 60-64 | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN |
| 35 | control_13 | 13 | 1 | 50-54 | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN |
| 36 | control_14 | 13 | 1 | 50-54 | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN |
| 37 | control_15 | 11 | 1 | 45-49 | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN |
| 38 | control_16 | 13 | 2 | 40-44 | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN |
| 39 | control_17 | 9 | 1 | 45-49 | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN |
| 40 | control_18 | 13 | 2 | 20-24 | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN |
| 41 | control_19 | 13 | 1 | 50-54 | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN |
| 42 | control_20 | 13 | 1 | 35-39 | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN |
| 43 | control_21 | 8 | 1 | 50-54 | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN |
| 44 | control_22 | 13 | 1 | 25-29 | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN |
| 45 | control_23 | 13 | 1 | 20-24 | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN |
| 46 | control_24 | 13 | 2 | 20-24 | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN |
| 47 | control_25 | 13 | 1 | 65-69 | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN |
| 48 | control_26 | 13 | 1 | 35-39 | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN |
| 49 | control_27 | 13 | 2 | 50-54 | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN |
| 50 | control_28 | 16 | 2 | 45-49 | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN |
| 51 | control_29 | 13 | 2 | 50-54 | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN |
| 52 | control_30 | 9 | 2 | 35-39 | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN |
| 53 | control_31 | 13 | 1 | 20-24 | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN |
| 54 | control_32 | 14 | 2 | 25-29 | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN |
Right away we can see there is a lot of missing data, mostly comprised of the depression data for the control group. We expect this to be missing for the control group, because depression data is only collected for those in the condition group; thus it's considered missing at random (MAR). The control group is also missing data for education, marriage, and work. This data could also be considered MAR, because it looks like only number, days, gender, and age were collected for the control group; therefore the data is missing because it's part of the control group. Since we'll be focusing mainly on the Actigraph data for this group, it shouldn't be a concern for the rest of the tutorial. For the condition group, there are three missing melancholia scores, and one missing education range value. It's not immediately clear which kind of missing data it is; after the following step, we'll take a closer look to see if there's any correlation between the condition group's missing data.
Notes
skipinitialspace=True will remove preceding whitespace before data that is not missing. For our data, we don't need to preserve whitespace so it doesn't cause any issues. NaN). However, we can still display the values as integers without modifying the actual data using pandas display options.Since we'll first be looking at differences between the condition and control group, let's split the scores data into a control group DataFrame and a condition group DataFrame. Even though we could use the numeric indices to select a subset of the DataFrame, in the event there is a greater number of rows / the partition of indices isn't clear, it may be useful to select rows based on the column values (e.g. control or condition) as follows.
# select rows whose number column starts with either control or condition, and make DataFrame
control_scores = scores.loc[scores['number'].str.startswith('control')]
condition_scores = scores.loc[scores['number'].str.startswith('condition')]
# make index correspond to number column
control_scores.reset_index(drop=True, inplace=True)
control_scores.index += 1
condition_scores.index += 1
# display control group data
print('\nControl Group:')
display(control_scores)
# format floats to display as int, add title, and display condition group data
print('\n\nCondition Group:')
pd.options.display.float_format = '{:,.0f}'.format
display(condition_scores)
Control Group:
| number | days | gender | age | afftype | melanch | inpatient | edu | marriage | work | madrs1 | madrs2 | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 1 | control_1 | 8 | 2 | 25-29 | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN |
| 2 | control_2 | 20 | 1 | 30-34 | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN |
| 3 | control_3 | 12 | 2 | 30-34 | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN |
| 4 | control_4 | 13 | 1 | 25-29 | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN |
| 5 | control_5 | 13 | 1 | 30-34 | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN |
| 6 | control_6 | 13 | 1 | 25-29 | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN |
| 7 | control_7 | 13 | 1 | 20-24 | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN |
| 8 | control_8 | 13 | 2 | 40-44 | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN |
| 9 | control_9 | 13 | 2 | 30-34 | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN |
| 10 | control_10 | 8 | 1 | 30-34 | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN |
| 11 | control_11 | 13 | 1 | 45-49 | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN |
| 12 | control_12 | 14 | 1 | 60-64 | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN |
| 13 | control_13 | 13 | 1 | 50-54 | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN |
| 14 | control_14 | 13 | 1 | 50-54 | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN |
| 15 | control_15 | 11 | 1 | 45-49 | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN |
| 16 | control_16 | 13 | 2 | 40-44 | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN |
| 17 | control_17 | 9 | 1 | 45-49 | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN |
| 18 | control_18 | 13 | 2 | 20-24 | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN |
| 19 | control_19 | 13 | 1 | 50-54 | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN |
| 20 | control_20 | 13 | 1 | 35-39 | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN |
| 21 | control_21 | 8 | 1 | 50-54 | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN |
| 22 | control_22 | 13 | 1 | 25-29 | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN |
| 23 | control_23 | 13 | 1 | 20-24 | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN |
| 24 | control_24 | 13 | 2 | 20-24 | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN |
| 25 | control_25 | 13 | 1 | 65-69 | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN |
| 26 | control_26 | 13 | 1 | 35-39 | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN |
| 27 | control_27 | 13 | 2 | 50-54 | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN |
| 28 | control_28 | 16 | 2 | 45-49 | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN |
| 29 | control_29 | 13 | 2 | 50-54 | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN |
| 30 | control_30 | 9 | 2 | 35-39 | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN |
| 31 | control_31 | 13 | 1 | 20-24 | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN |
| 32 | control_32 | 14 | 2 | 25-29 | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN |
Condition Group:
| number | days | gender | age | afftype | melanch | inpatient | edu | marriage | work | madrs1 | madrs2 | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 1 | condition_1 | 11 | 2 | 35-39 | 2 | 2 | 2 | 6-10 | 1 | 2 | 19 | 19 |
| 2 | condition_2 | 18 | 2 | 40-44 | 1 | 2 | 2 | 6-10 | 2 | 2 | 24 | 11 |
| 3 | condition_3 | 13 | 1 | 45-49 | 2 | 2 | 2 | 6-10 | 2 | 2 | 24 | 25 |
| 4 | condition_4 | 13 | 2 | 25-29 | 2 | 2 | 2 | 11-15 | 1 | 1 | 20 | 16 |
| 5 | condition_5 | 13 | 2 | 50-54 | 2 | 2 | 2 | 11-15 | 2 | 2 | 26 | 26 |
| 6 | condition_6 | 7 | 1 | 35-39 | 2 | 2 | 2 | 6-10 | 1 | 2 | 18 | 15 |
| 7 | condition_7 | 11 | 1 | 20-24 | 1 | NaN | 2 | 11-15 | 2 | 1 | 24 | 25 |
| 8 | condition_8 | 5 | 2 | 25-29 | 2 | NaN | 2 | 11-15 | 1 | 2 | 20 | 16 |
| 9 | condition_9 | 13 | 2 | 45-49 | 1 | NaN | 2 | 6-10 | 1 | 2 | 26 | 26 |
| 10 | condition_10 | 9 | 2 | 45-49 | 2 | 2 | 2 | 6-10 | 1 | 2 | 28 | 21 |
| 11 | condition_11 | 14 | 1 | 45-49 | 2 | 2 | 2 | 6-10 | 1 | 2 | 24 | 24 |
| 12 | condition_12 | 12 | 2 | 40-44 | 1 | 2 | 2 | 6-10 | 2 | 2 | 25 | 21 |
| 13 | condition_13 | 14 | 2 | 35-39 | 1 | 2 | 2 | 11-15 | 2 | 2 | 18 | 13 |
| 14 | condition_14 | 14 | 1 | 60-64 | 1 | 2 | 2 | 6-10 | 2 | 2 | 28 | 19 |
| 15 | condition_15 | 13 | 2 | 55-59 | 2 | 2 | 2 | 11-15 | 1 | 1 | 14 | 18 |
| 16 | condition_16 | 16 | 1 | 45-49 | 2 | 2 | 2 | 11-15 | 1 | 2 | 13 | 17 |
| 17 | condition_17 | 13 | 1 | 50-54 | 1 | 2 | 2 | 6-10 | 1 | 2 | 17 | 15 |
| 18 | condition_18 | 13 | 2 | 40-44 | 3 | 2 | 2 | 11-15 | 2 | 2 | 18 | 15 |
| 19 | condition_19 | 13 | 2 | 50-54 | 2 | 2 | 1 | 16-20 | 2 | 2 | 26 | 21 |
| 20 | condition_20 | 13 | 1 | 30-34 | 2 | 1 | 1 | 6-10 | 1 | 2 | 27 | 25 |
| 21 | condition_21 | 13 | 2 | 35-39 | 2 | 2 | 1 | 6-10 | 2 | 2 | 26 | 21 |
| 22 | condition_22 | 14 | 1 | 65-69 | 2 | 2 | 1 | NaN | 2 | 2 | 29 | 28 |
| 23 | condition_23 | 16 | 1 | 30-34 | 2 | 2 | 1 | 16-20 | 2 | 2 | 29 | 23 |
We could remove all the NaN columns for the control group since we won't use them in the analysis, but they can also be kept so the columns stay the same as the condition group DataFrame.
As mentioned previously, the condition group has missing data. It doesn't look as though it depends on any data in other columns, so it may be missing completely at random (MCAR). We can use different types of plots to display a correlation between missing values.
With only two columns containing missing data, the following missingno heatmap is not too informative, but it could be useful for larger datasets with more missing data.
msno.heatmap(condition_scores, figsize=(6,5))
<AxesSubplot:>
Since the goal in identifying the type of missing data is to determine if and how it might affect future analyses, let's use seaborn to plot the scores we'll be using in the analysis (MADRS 1 and 2) on the X and Y axis, then see if the missing melancholia values correlate in any way. We can use the color and size to differentiate the NA values, as well as style to mark the education range values.
# dataframe for purpose of plotting, with filled NA values
cond_NA = condition_scores.fillna('NA')
# plot melancholia to look at NA values
sns.relplot(x=cond_NA['madrs1'], y=cond_NA['madrs2'], style=cond_NA['edu'], size=cond_NA['melanch'], \
size_order=['NA', 2.0, 1.0], sizes=(60, 200), hue=cond_NA['melanch'], hue_order=[1.0, 2.0, 'NA'], \
palette='icefire', alpha=0.8, height=6).set(title='Melancholia with respect to MADRS scores')
<seaborn.axisgrid.FacetGrid at 0x7f6d01194820>
Other than 2 of the 3 NA points overlapping other plotploints, the figure doesn't show a particular correlation, e.g. that all the missing melancholia scores have the same MADRS 1 or 2 scores, or that all three have the same education range value. As a result, we can move forward with our analysis and treat the missing condition group data as MCAR.
Now that we've organized the scores data, we need to get all the Actigraph data. In each part, after curating the control group data, we will repeat the process for the condition group. To store the timestamps and activity data, we'll make a control list and a condition list, and fill them with DataFrames for each csv file (each participant). That way we can easily index the list by the control group. In addition, we'll map the participant number to their row counts (number of Actigraph recordings). This will help later on when selecting a range of data to plot. There is also a check in the code that will print a message if there are any missing values in the Actigraph data. Lastly, for plotting we'll also create a DataFrame with columns of just the activity data.
# get list of csv filenames in control folder (data/control is relative path)
control_filenames = glob.glob(os.path.join('data/control', '*.csv'))
# use list comprehension to create list of DataFrames from csv files
# select timestamp and activity columns (date is repetitive) and parse dates so they are converted to DateTime objects
control_group = [pd.read_csv(filename, parse_dates=['timestamp'])[['timestamp', 'activity']] for filename in control_filenames]
# print head of first DataFrame in list to show how data is stored
display(control_group[0].head())
# maps number of rows to participant number, prints if any NA values
control_rows = {}
for i, num in enumerate(control_group):
if num.isna().any(axis=None): print('Control {} has NaN values'.format(i+1))
control_rows[i+1] = len(num)
print('\n\nMinutes of activity collected for each participant in the control group:\n\n', control_rows)
minv = min(control_rows, key=control_rows.get)
print('\nGroup with least activity data: {} ({})'.format(minv, control_rows[minv]))
| timestamp | activity | |
|---|---|---|
| 0 | 2003-11-11 09:00:00 | 9 |
| 1 | 2003-11-11 09:01:00 | 7 |
| 2 | 2003-11-11 09:02:00 | 7 |
| 3 | 2003-11-11 09:03:00 | 7 |
| 4 | 2003-11-11 09:04:00 | 7 |
Minutes of activity collected for each participant in the control group:
{1: 29038, 2: 21824, 3: 51589, 4: 31936, 5: 65407, 6: 20490, 7: 24772, 8: 22257, 9: 24433, 10: 34637, 11: 27609, 12: 31891, 13: 21688, 14: 51619, 15: 21655, 16: 24553, 17: 28918, 18: 21669, 19: 21818, 20: 33365, 21: 51490, 22: 31473, 23: 24614, 24: 31455, 25: 28905, 26: 22259, 27: 22263, 28: 47133, 29: 51611, 30: 24536, 31: 51389, 32: 21694}
Group with least activity data: 6 (20490)
# get list of csv filenames in control folder (data/control is relative path)
condition_filenames = glob.glob(os.path.join('data/condition', '*.csv'))
# use list comprehension to create list of DataFrames from csv files
# select timestamp and activity columns (date is repetitive) and parse dates so they are converted to DateTime objects
condition_group = [pd.read_csv(filename, parse_dates=['timestamp'])[['timestamp', 'activity']] for filename in condition_filenames]
# print head of first DataFrame in list to show how data is stored
display(condition_group[0].head())
# maps number of rows to participant number, prints if any NA values
condition_rows = {}
for i, num in enumerate(condition_group):
if num.isna().any(axis=None): print('Condition {} has NaN values'.format(i+1))
condition_rows[i+1] = len(num)
print('\n\nMinutes of activity collected for each participant in the condition group:\n\n', condition_rows)
minv = min(condition_rows, key=condition_rows.get)
print('\nGroup with least activity data: {} ({})'.format(minv, condition_rows[minv]))
| timestamp | activity | |
|---|---|---|
| 0 | 2003-05-07 15:00:00 | 1468 |
| 1 | 2003-05-07 15:01:00 | 1006 |
| 2 | 2003-05-07 15:02:00 | 468 |
| 3 | 2003-05-07 15:03:00 | 306 |
| 4 | 2003-05-07 15:04:00 | 143 |
Minutes of activity collected for each participant in the condition group:
{1: 38926, 2: 25910, 3: 23244, 4: 21347, 5: 21493, 6: 21231, 7: 21531, 8: 21772, 9: 21555, 10: 25847, 11: 21829, 12: 21648, 13: 31485, 14: 21646, 15: 22175, 16: 22147, 17: 41847, 18: 20318, 19: 22990, 20: 19299, 21: 21433, 22: 21556, 23: 20487}
Group with least activity data: 20 (19299)
control_filenames = glob.glob(os.path.join('data/control', '*.csv'))
# storing the activity data for the control group as columns in a DataFrame
group = 1
control_activity = pd.DataFrame()
for filename in control_filenames:
file_df = pd.read_csv(filename, parse_dates=['timestamp'])[['timestamp', 'activity']]
control_activity[str(group)] = file_df['activity']
group += 1
display(control_activity[0:5])
| 1 | 2 | 3 | 4 | 5 | 6 | 7 | 8 | 9 | 10 | ... | 23 | 24 | 25 | 26 | 27 | 28 | 29 | 30 | 31 | 32 | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 9 | 0 | 19 | 0 | 0 | 1,221 | 5 | 3 | 219 | 3 | ... | 82 | 30 | 3 | 7 | 6 | 0 | 60 | 3 | 321 | 0 |
| 1 | 7 | 0 | 97 | 0 | 0 | 667 | 3 | 3 | 712 | 2 | ... | 0 | 0 | 0 | 5 | 4 | 3 | 0 | 0 | 115 | 0 |
| 2 | 7 | 0 | 586 | 0 | 0 | 469 | 3 | 3 | 1,076 | 0 | ... | 0 | 151 | 0 | 5 | 4 | 3 | 264 | 7 | 240 | 0 |
| 3 | 7 | 0 | 1183 | 0 | 0 | 349 | 3 | 3 | 948 | 0 | ... | 0 | 156 | 0 | 5 | 4 | 0 | 662 | 0 | 888 | 0 |
| 4 | 7 | 0 | 266 | 0 | 0 | 178 | 3 | 3 | 1,042 | 0 | ... | 0 | 1509 | 0 | 5 | 4 | 0 | 293 | 0 | 1378 | 0 |
5 rows × 32 columns
condition_filenames = glob.glob(os.path.join('data/condition', '*.csv'))
# storing the activity data for the condition group as columns in a DataFrame
group = 1
condition_activity = pd.DataFrame()
for filename in condition_filenames:
file_df = pd.read_csv(filename, parse_dates=['timestamp'])[['timestamp', 'activity']]
condition_activity[str(group)] = file_df['activity']
group += 1
display(condition_activity[0:5])
| 1 | 2 | 3 | 4 | 5 | 6 | 7 | 8 | 9 | 10 | ... | 14 | 15 | 16 | 17 | 18 | 19 | 20 | 21 | 22 | 23 | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 1468 | 0 | 0 | 0 | 0 | 249 | 0 | 111 | 91 | 97 | ... | 0 | 5 | 0 | 0 | 5 | 0 | 3 | 0 | 510 | 0 |
| 1 | 1006 | 0 | 143 | 0 | 0 | 69 | 0 | 66 | 0 | 498 | ... | 0 | 3 | 0 | 3 | 5 | 53 | 190 | 0 | 637 | 349 |
| 2 | 468 | 0 | 0 | 0 | 0 | 116 | 0 | 157 | 0 | 249 | ... | 0 | 3 | 0 | 0 | 5 | 0 | 8 | 0 | 598 | 111 |
| 3 | 306 | 0 | 20 | 0 | 0 | 258 | 0 | 73 | 0 | 396 | ... | 0 | 3 | 3 | 0 | 5 | 0 | 3 | 0 | 251 | 38 |
| 4 | 143 | 0 | 166 | 0 | 0 | 152 | 0 | 142 | 0 | 209 | ... | 0 | 283 | 0 | 0 | 5 | 0 | 14 | 0 | 93 | 3 |
5 rows × 23 columns
As the comments in the code describe, we parse the timestamps while reading the csv so they'll be formatted as DateTime objects, which may come in handy later on in the tutorial. So far, we've prepared the data in a couple different ways for later use: a list of DataFrames, a DataFrame each for the control and condition group with the Actigraph data as columns, and we've also stored the row counts (number of Actigraph recordings) for each participant in a dictionary.
Now, let's use matplotlib to get a visual idea of the data we're working with.
warnings.filterwarnings('ignore') # ignoring tick formatter warning
plt.figure(figsize=(20, 16))
plots = []
# this loop can be adjusted to plot a different number of plots
num = 1
for row in range(1):
for col in range(1, 2):
# plot
ax = plt.subplot2grid((6,4), (row, col))
ax.scatter((control_group[num])['timestamp'], control_group[num]['activity'], alpha=0.1, s=0.1)
# set axis formatting
ax.set_xticklabels(ax.get_xticks(), rotation = 45)
ax.xaxis.set_major_formatter(mdates.DateFormatter('%Y-%m'))
# set label format
ax.set(title='Control ' + str((row*4)+(col+1)), xlabel='Date (year and month)', ylabel='Activity (intensity)')
num += 1
warnings.filterwarnings('ignore') # ignoring tick formatter warning
plt.figure(figsize=(20, 16))
plots = []
# this loop can be adjusted to plot a different number of plots
num = 1
for row in range(1):
for col in range(1, 3):
# plot
ax = plt.subplot2grid((6,4), (row, col))
ax.scatter((condition_group[num])['timestamp'], condition_group[num]['activity'], alpha=0.1, s=0.5)
# set axis formatting
ax.set_xticklabels(ax.get_xticks(), rotation = 45)
ax.xaxis.set_major_formatter(mdates.DateFormatter('%Y-%m'))
# set label format
ax.set(title='Condition ' + str((row*4)+(col+1)), xlabel='Date (year and month)', ylabel='Activity (intensity)')
num += 1
We can see from these 3 samples that the Actigraph data is spread over 1 or 2 months, and there is a lot of variation in activity intensity over time. The variation might be something interesting to explore in the next section.
The subplots have their own y-axis tick values, so we need to carefully look at the range if we want to compare between the samples. From this initial glance at the data, we can see the upper range of the control group's activity reaches 2,000 (excluding outliers). Both condition 2 and 3 have a high density of points under 1000, and scattered points and spikes going up to 3000-3500. Although we can't make any assumptions about the group as a whole since we're only looking at three participants, we've seen there are certain days or weeks during the couple months recorded where the activity spikes, and the intensity value is between the 0-8000 range.
For the first step in the exploratory data analysis, let's see if we can better understand the range and other statistics about each group using boxplots. It's important to acknowledge the size of our Actigraph datasets (up to 50,000 rows per participant) can create some difficulty when plotting. For example, we saw earlier there are frequent spikes in the activity intensity, thus we can expect there will be many outliers. This may impact the decision on whether or not to display the fliers (points past the end of the whiskers which represent the outliers). In fact, showfliers=False reduced the file size of the .ipynb notebook version of this tutorial by 15mb, so in the end the fliers had to go. For the following boxplots, we'll focus on the other information illustrated, such as medians and quartile ranges.
fig, axes = plt.subplots(1, 2, figsize=(14, 6))
# plot control (and format)
control_activity_plot = control_activity.boxplot(figsize=(6, 3), grid=False, \
boxprops=dict(color="r"), flierprops=dict(marker='.', markeredgecolor='darkred', markersize=0.5, alpha=0.5), \
whiskerprops=dict(color='r'), medianprops=dict(color='purple'), ax=axes[0], showfliers=False)
control_activity_plot.set(title='Control Group Activity', xlabel='Participant', ylabel='Activity (intensity)')
axes[0].set_ylim(-100, 1800)
# plot condition (and format)
condition_activity_plot = condition_activity.boxplot(figsize=(6, 3), grid=False, \
flierprops=dict(marker='.', markeredgecolor='b', markersize=0.5, alpha=0.5), ax=axes[1], showfliers=False)
condition_activity_plot.set(title='Condition Group Activity', xlabel='Participant', ylabel='Activity (intensity)')
axes[1].set_ylim(-100, 1800)
fig.tight_layout()
plt.show()
From the plots, we can tell the median activity for both groups is quite low; no participant seems to have a median activity much above 100. The control group generally has longer whiskers, likely because there are more outliers, which represents more spikes in activity intensity. All of the 3rd quartiles in the condition group plot and most in the control group plot are less than 500, indicating 75% of activity data is below 500.
Next let's look at the sum of activity data for each participant. This discards any consideration of variation (which we'll be looking at soon), but it will be interesting to learn if there's any correlation between total activity over time and MADRS scores. One way to approach the sums is to select a range of activity data to use; otherwise we'd be adding different amounts of activity data for each participant since there are different date ranges. For that route, we'd use the minimum number of rows we found previously, 19299. Some of the participants have 50,000 rows of activity data, so selecting 19299 of the first rows doesn't account for the case where a participant goes from one extreme to another in the middle, i.e. going from being very active to barely active. Because of that, let's normalize the activity sums instead, by summing all the activity data but dividing it by the number of rows of data, to get the average activity value for each participant. This way, we can consider the entire timeframe the participants' activity was being recorded.
# get list of control columns, compute average for each column, then print
control_activity_cols = list(control_activity)
control_col_avgs = list(control_activity[control_activity_cols].sum(axis=0)/control_activity[control_activity_cols].count())
print('Average activity for each participant in control group:\n\n', control_col_avgs)
Average activity for each participant in control group: [107.26665059577105, 351.91458944281527, 266.9782009780288, 188.41500792065568, 173.4560575797231, 314.0471937530503, 162.98122880671727, 168.27505953183268, 237.9501903163754, 110.05826847579034, 169.430330689268, 251.2608995109856, 245.2155108815935, 121.13086300709415, 280.40909720618794, 205.88689773143813, 290.4881042949028, 219.45890442567725, 295.3337152809607, 264.74622907913766, 286.3908671396102, 423.75318548109374, 241.54099293085235, 148.96425373648322, 176.769001902785, 343.7215059077227, 278.20203925796164, 213.62890006198774, 173.96680212135823, 293.4336892729051, 259.0020318203733, 189.766847976399]
# get list of condition columns, compute average for each column, then print
condition_activity_cols = list(condition_activity)
condition_col_avgs = list(condition_activity[condition_activity_cols].sum(axis=0)/condition_activity[condition_activity_cols].count())
print('Average activity for each participant in condition group:\n\n', condition_col_avgs)
Average activity for each participant in condition group: [153.66474849714842, 221.34874565804708, 146.94802959903632, 71.10408956762075, 167.24598706555622, 157.24021478027413, 85.84218104128931, 161.75606283299652, 289.64722802134077, 54.697992030022824, 109.53406019515323, 265.29970436068, 202.61908845481975, 75.04989374480273, 261.8055016910936, 151.4448458030433, 152.3616092072137, 178.7733044591003, 129.38303610265334, 184.93445256230893, 196.41641394111883, 274.8669975876786, 79.48006052618734]
fig, axes = plt.subplots(1, 2, figsize=(15, 5))
# plot control group, set title and axis labels and limit
ax = sns.barplot(x=control_activity_cols, y=control_col_avgs, palette='Spectral', ax=axes[0])
ax.set(title='Actigraph Data Averages for Control Group', xlabel='Participant', ylabel='Activity')
ax.set_ylim(ymax=500)
# plot condition group, set title and axis labels and limit
ax = sns.barplot(x=condition_activity_cols, y=condition_col_avgs, palette='Spectral', ax=axes[1])
ax.set(title='Actigraph Data Averages for Condition Group', xlabel='Participant', ylabel='Activity')
ax.set_ylim(ymax=500)
plt.show()
The averages for the condition group tends to be less than the control group. There also appears to be more distinct levels of variation among condition group participants (i.e. high, medium, low). A question to explore further is if these differences can be linked to different affiliation types (1: bipolar II, 2: unipolar depressive, 3: bipolar I).
In the condition group, participants 4, 7, 10, 14, and 23 have smaller activity averages; when we analyze the correlation between activity and MADRS scores in the next step, I'd expect these participants to have a higher MADRS score indicating a more severe depression (unless we will not be able to reject the null hypothesis). Since we've averaged the activity data over time, we'll also average the MADRS 1 and 2 scores (score when activity measurement started and stopped).
# dataframe for purpose of plotting, with filled NA values
cond_NA = condition_scores.fillna('NA')
# plot average activity levels with average MADRS scores
madrs_avgAct = sns.relplot(x=condition_col_avgs, y=(condition_scores['madrs1']+condition_scores['madrs2'])/2, \
height=7, hue=condition_scores['afftype'], s=100, palette='viridis')
#format
madrs_avgAct.set(title='Actigraph and MADRS Score Averages', xlabel='Average Activity', ylabel='MADRS Score')
# add point labels for participants
for num in condition_activity_cols:
plt.text(condition_col_avgs[int(num)-1] + 4, (condition_scores['madrs1'][int(num)]+ \
condition_scores['madrs2'][int(num)])/2, num, horizontalalignment='left', size='small')
plt.show()
The participants appear to be clustered in two groups, scores of 0-20 and 22-30. Almost all those I expected to have high scores based on low activity do, except 4. This is our first visualization in the tutorial, which shows for the condition group in particular, that activity could affect the MADRS score. Looks like we could be on the right track. However, the plot is not so accurate for higher activity levels; for example, out of 9, 12, 15, and 22 (the highest average activity levels from the previous step), only one is in the lower score group like we might expect.
We know from the condition group scores that a majority of participants have affiliation type 2 (unipolar depressive). There is a pretty even split in affiliation types between the lower and higher score range. This isn't surprising given that we've taken averages of the scores and activity. In fact, it brings us to our next step: visualizing the variation and seeing if there's a correlation between the variation in activity, and the MADRS score or affiliation type.
fig, (ax1, ax2) = plt.subplots(ncols=2, figsize=(14, 6))
# plot activity variance for MADRS 1 and 2
madrs1 = sns.regplot( x=condition_activity.var(axis=0), y=condition_scores['madrs1'], \
ax=ax1, fit_reg=True, lowess=True).set(title='Variance in Activity (MADRS 1)', xlabel='Variance')
madrs2 = sns.regplot( x=condition_activity.var(axis=0), y=condition_scores['madrs2'], ax=ax2, \
fit_reg=True, lowess=True).set(title='Variance in Activity (MADRS 2)', xlabel='Variance')
plt.show()
The variance plots are not what we would hope for, since they don't show any obvious trend or cluster of data. The Loess curve, which is a local regression because the curve is weighted toward the nearest data point, shows some relation between the MADRS 1 and MADRS 2 scores, but not much else. It's clear there's no linear regression, so it was worth trying out the Loess curve option to see if we could observe some relation between the variance and MADRS score. It does appear that participants with activity in the middle variance range tend to have lower MADRS scores. One possible explanation is participants with more regulated daily/weekly activity levels experience a less severe depression.
Even though I hoped the variance would show some correlation to the MADRS scores, I think the lack of it can be explained by sleep schedules or daily routines. Activity for every participant will likely decrease during the night time hours, and thus the analysis of the variance is probably not the best way to predict MADRS scores or depression states. I think it would be interesting to explore this idea further and do something like the code below, to see if there is a correlation when plotting only certain times of the day. Since we know the circadian rythm can be affected by a depressive episode, it may also be worthwhile to plot the activity in the middle of the night.
filtered = []
condition_activity_filt = pd.DataFrame()
condition_activity_filt.reset_index(drop=True, inplace=True)
# get activity data within certain time
for i, num in enumerate(condition_group):
num = num.set_index(['timestamp'])
filtered.append(num.between_time('08:00:00', '20:00:00'))
condition_activity_filt[i+1] = num['activity'].between_time('08:00:00', '20:00:00')
So far we've compared the activity data between the control and condition group, and we've also looked for correlations in activity and MADRS scores within the condition group. Through the scatterplot above, variance has proved it won't be a good predictor of depression states or MADRS scores.
The next step is to figure out if we can reject the null hypothesis: that there is no relation between activity data and depression states/MADRS scores. In the following two segments, we'll train two machine learning models to predict the affiliation type (1: bipolar II, 2: unipolar depressive, 3: bipolar I) based on the activity data. Therefore, the activity data will be the idependent variable, and the affiliation type the dependent variable. We'll be treating each participant's activity data as a feature group so that the model can learn from multiple participants' activity. We also need to select the rows of activity data to use, because the models need to be trained on the same size of data. 19299 is the minimum rows of all the condition group participants, so we use that value.
The two machine learning algorithms we'll use are Random Forest Classifier and Logistic Regression from sklearn. Random Forests is based on decision trees, which will help to classify the participant's affiliation based on the activity. Since we're working with affiliation types of 1 or 2 (there is only one 3), we can use logistic regression to see if there's a relationship between the affiliation type and activity. We'll use a test size of 7, since that's about 1/3 of the condition group.
The n_estimators parameter is the number of trees in the forest; generally the larger it is the better the prediction, but larger values will also cause increased computation time.
X = []
y = condition_scores['afftype']
# add activity levels as independent variable
for x in condition_group:
X.append(x['activity'][0:19299])
# train and predict
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=7, random_state=73)
clf = RandomForestClassifier(n_estimators=1000)
clf.fit(X_train, y_train)
y_pred = clf.predict(X_test)
# create confusion matrix
cm_rfc = metrics.confusion_matrix(y_test, y_pred)
from sklearn.linear_model import LogisticRegression
logisticRegr = LogisticRegression()
X = []
y = condition_scores['afftype']
# add activity levels as independent variable
for x in condition_group:
X.append(x['activity'][0:19299])
# train and predict
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=7, random_state=73)
logisticRegr.fit(X_train, y_train)
y_pred = logisticRegr.predict(X_test)
# create confusion matrix
cm_lr = metrics.confusion_matrix(y_test, y_pred)
Now that we've trained the models, let's plot the confusion matrix for each so we can see how accurately the models predicted the affiliation type.
fig, (ax1, ax2) = plt.subplots(ncols=2, figsize=(14, 6))
# formatting labels
group_names = ['True Neg','False Pos','False Neg','True Pos']
group_percentages = ['{0:.0%}'.format(value) for value in cm_rfc.flatten()/np.sum(cm_rfc)]
labels = [f'{v1}\n{v2}' for v1, v2 in zip(group_names, group_percentages)]
labels = np.asarray(labels).reshape(2,2)
# plot random forest classifier confusion matrix
rfc = sns.heatmap(cm_rfc, annot=labels, fmt='', linewidths=.5, square = True, cmap = 'Blues', ax=ax1)
rfc.set(xlabel='Predicted', xticklabels=(1, 2), ylabel='Actual', yticklabels=(1, 2), \
title = 'Random Forest Classifier\nAccuracy Score: {0:.4f}'.format(metrics.accuracy_score(y_test, y_pred)))
# formatting labels
group_names = ['True Neg','False Pos','False Neg','True Pos']
group_percentages = ['{0:.0%}'.format(value) for value in cm_lr.flatten()/np.sum(cm_lr)]
labels = [f'{v1}\n{v2}' for v1, v2 in zip(group_names, group_percentages)]
labels = np.asarray(labels).reshape(2,2)
# plot logistic regression confusion matrix
lr = sns.heatmap(cm_lr, annot=labels, fmt='', linewidths=.5, square = True, cmap = 'Blues', ax=ax2)
lr.set(xlabel='Predicted', xticklabels=(1, 2), ylabel='Actual', yticklabels=(1, 2), \
title = 'Logistic Regression\nAccuracy Score: {0:.4f}'.format(logisticRegr.score(X_test, y_test)))
plt.show()
Considering we only looked at a subset of the activity data, a .7143 accuracy score is better than I was expecting. We shouldn't base our analysis on the accuracy score alone though, because it doesn't give enough information to completely evaluate the model. Looking at the heatmaps, we can see there's a true positive rate of 71% for the logistic regression classifying type 2: unipolar depressive correctly. This could indicate the activity for participants with unipolar depression is more easily distiguished than those with other affiliation types. It could also be because there are more participants with unipolar depression, there's more of a chance they'll be identified correctly based on the training data. However, if the prediction was based on number of the affiliation type, we might expect to see more false negatives as well.
I did attempt to predict MADRS scores based on activity, but it was a challenging endeavor, and had 0 or 20% accuracy only. I think we would need a machine learning algorithm that does a much more complex and in-depth evaluation of the activity data, in order to predict MADRS scores with accuracy. Because of the large variation in activity data over time, we would need a way to analyze each change in activity. Perhaps counting the number of spikes and measuring the height of those spikes could be one path forward. We might think those with a higher MADRS score will have less spikes, and for them to be of lower height because they're less active. However this may not be the case; different affiliations could impact the variation in activity. Those with bipolar depression might also experience manic episodes while activity is being measured, which would cause a quick increase in activity intensity. For that reason, a better machine learning model should probably use affiliation type and activity data as feature groups.
In a final effort to find some more meaningful results, we'll use sklearn's HuberRegressor and evaluate the model with 10-fold cross validation. The reason for using this model is it optimizes the squared loss and absolute loss, meaning that the prediction is not as influenced by outliers. Since all of the spikes in activity create a lot of outliers, using HuborRegressor for a more robust form of cross validation may allow us to get a more accurate error analysis.
# set up k-fold cv
cv = RepeatedKFold(n_splits=10, n_repeats=3, random_state=1)
# get the error using the hubor regressor and cross validation
error = cross_val_score(HuberRegressor(), X, y, scoring='neg_mean_absolute_error', cv=cv)
# display mean MAE and standard deviation of the error
print('Mean mean-absolute-error: {:.4f}, std_dev: {:.4f}'.format(mean(absolute(error)), std(error)))
Mean mean-absolute-error: 0.8990, std_dev: 0.3406
Even though it's not too encouraging to see almost 90% error, it does illustrate how tough it is to predict something as complex as mental health. For this tutorial we are not able to reject the null hypothesis, because during the hypothesis testing we could not provide enough evidence, via the machine learning models, that it's possible to predict depression states or MADRS scores accurately based on Actigraph data. However, that also does not mean there is no relation, or that it will not be possible to predict depression states accurately based on Actigraph data. As mentioned in the introduction, Actigraph recordings are considered an objective method of observing depression, so we just need a way to build a better machine learning model.
Some of the limitations of the models during the hypothesis testing were based on data. During the exploratory data analysis and hypothesis testing, some avenues for further exploring and analyzing the data were discussed, which may help to find better feature groups and parameters to pass to the machine learning algorithms.
Overall, we were able to make distinctions between the control and condition group's activity, but not predict depression states based on that activity. We looked at the averages and variation in activity, as well as used confusion matrices (heatmaps) to display the somewhat promising predictions from the Random Forest Classifier and Logistic Regression models. Since the model is only as good as the data it's given, the next step to predicting depression states more accurately would be to explore the Actigraph data more.